Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving Usability of ASCII Strings #1304

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

nguyenv
Copy link
Collaborator

@nguyenv nguyenv commented Aug 30, 2022

  • Previously TILEDB_STRING_ASCII data was inconsistently displayed
    as bytes
  • There is a need to coerce to str everywhere because (1) previously
    the resulting dataframe displayed ASCII as bytes with Pyarrow disabled
    but as str with Pyarrow enabled, and (2) this fix would remove the
    need to copy large amounts of data to convert back and forth in the
    TileDB-SingleCell Python API
  • Warning now emitted to the user to pass dtype="ascii" for string dim
    types in lieu of np.bytes_ or np.str_ for clarity. All three still
    work and under the hood use np.str_ and TILEDB_STRING_ASCII
  • repr of string dimensions is now always displayed as dtype="ascii".
    Calling .dtype() will return np.dtype('U') as the return signature
    of dtype requires a Numpy dtype

@shortcut-integration
Copy link

This pull request has been linked to Shortcut Story #20965: Unicode-to-ASCII and ASCII-to-Unicode in TileDB-Py.

@nguyenv nguyenv requested review from ihnorton and johnkerl August 30, 2022 15:42
@nguyenv nguyenv marked this pull request as draft August 30, 2022 15:56
* Previously `TILEDB_STRING_ASCII` data was inconsistently displayed
  as `bytes`
* There is a need to coerce to `str` everywhere because (1) previously
  the resulting dataframe displayed ASCII as bytes with Pyarrow disabled
  but as str with Pyarrow enabled, and (2) this fix would remove the
  need to copy large amounts of data to convert back and forth in the
  TileDB-SingleCell Python API
* Warning now emitted to the user to pass `dtype="ascii"` for string dim
  types in lieu of `np.bytes_` or `np.str_` for clarity. All three still
  work and under the hood use `np.str_` and `TILEDB_STRING_ASCII`
* `repr` of string dimensions is now always displayed as `dtype="ascii"`.
  Calling `.dtype()` will return `np.dtype('U')` as the return signature
  of `dtype` requires a Numpy dtype
@nguyenv nguyenv force-pushed the viviannguyen/sc-20965/unicode-to-ascii-and-ascii-to-unicode-in branch from 3202aee to dead46e Compare August 30, 2022 16:26
@johnkerl
Copy link
Contributor

@nguyenv testing now -- thanks! :)

@johnkerl johnkerl marked this pull request as ready for review August 30, 2022 17:31
@johnkerl
Copy link
Contributor

@nguyenv using SOMA tiledb://johnkerl-tiledb/Krasnow, obs array
https://cloud.tiledb.com/arrays/details/johnkerl-tiledb/351316c2-a1e4-400f-828d-758517517cfb/schema

The obs_id dim is STRING_ASCII but the string attrs show up in that web-UI schema presentation as CHAR

Also, I see:

>>> obs_uri = 'tiledb://johnkerl-tiledb/351316c2-a1e4-400f-828d-758517517cfb'
>>> O = tiledb.open(obs_uri)
>>> O.df[:]
                       nGene  nUMI  channel     region  ...        sex    tissue   ethnicity           development_stage
obs_id                                                  ...
P1_1_AAACCTGAGCGATAGC    959  3583  b'P1_1'  b'normal'  ...    b'male'  b'blood'  b'unknown'  b'75-year-old human stage'
P1_1_AAACCTGAGGCAAAGA   1168  3480  b'P1_1'  b'normal'  ...    b'male'  b'blood'  b'unknown'  b'75-year-old human stage'
P1_1_AAACCTGGTATTCGTG   2053  7838  b'P1_1'  b'normal'  ...    b'male'  b'blood'  b'unknown'  b'75-year-old human stage'
P1_1_AAACCTGGTGATGTCT   1099  3976  b'P1_1'  b'normal'  ...    b'male'  b'blood'  b'unknown'  b'75-year-old human stage'
P1_1_AAACGGGAGACAATAC   1225  4403  b'P1_1'  b'normal'  ...    b'male'  b'blood'  b'unknown'  b'75-year-old human stage'
...                      ...   ...      ...        ...  ...        ...       ...         ...                         ...
P3_8_TTTGTCAGTCGGCATC   1782  6362  b'P3_8'  b'normal'  ...  b'female'  b'blood'  b'unknown'  b'51-year-old human stage'
P3_8_TTTGTCAGTCTAGTGT   1241  3505  b'P3_8'  b'normal'  ...  b'female'  b'blood'  b'unknown'  b'51-year-old human stage'
P3_8_TTTGTCATCACATACG   1234  3442  b'P3_8'  b'normal'  ...  b'female'  b'blood'  b'unknown'  b'51-year-old human stage'
P3_8_TTTGTCATCCTAGTGA   1150  2469  b'P3_8'  b'normal'  ...  b'female'  b'blood'  b'unknown'  b'51-year-old human stage'
P3_8_TTTGTCATCTTGCATT   2364  8348  b'P3_8'  b'normal'  ...  b'female'  b'blood'  b'unknown'  b'51-year-old human stage'
^^^^ strings                                 ^^^^^^ still bytes
[65662 rows x 29 columns]

This is (as a reminder) in reference to single-cell-data/TileDB-SOMA#99 -- please see there for the reason that we cannot store these attributes as Unicode within the TileDB storage.

@nguyenv nguyenv marked this pull request as draft August 30, 2022 20:31
@nguyenv
Copy link
Collaborator Author

nguyenv commented Aug 30, 2022

We have resolved the original problem that prompted this PR.

However, this branch still contains several features that may be important for usability such as consistent presentation of ASCII as str, explicitly displaying dtype="ascii" in Dim.repr and Attr.repr, and favoring the use of passing dtype="ascii" instead of np.str_ or np.bytes_ for dimensions.

@nguyenv nguyenv changed the title TILEDB_STRING_ASCII Now Displaying As UTF-8 / str Everywhere Improving Usability of ASCII Types Aug 30, 2022
@nguyenv nguyenv changed the title Improving Usability of ASCII Types Improving Usability of ASCII Strings Aug 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants